BERT

Introduction

In the previous section we saw how Transformers work. Note that when we talked about Transformers, we talked about a Decoder-only Transformer. In this section, we will see how an Encoder-only Transformer works. We’ll focus on Bidirectional Encoder Transformers (BERT) which is the most famous encoder-only transformer.

Masked Language Modeling

BERT uses masked language modeling to be trained. In this task, some tokens are randomly masked and the model learns to predict them. Let’s consider the examplemsentence shown at the top of the page:

“Howw are you doing today?”

For training BERT, the word you is masked. So, the sentence becomes:

“Howw are [MASK] doing today?”

The model learns to predict the masked word using the context of the surrounding words. Teh context is prvided using the attention mechanisim. Note that unlike the decoder-only transformers described in the previous section, the encoder-only transformer is able to attend to both the left and right of the masked word. So, in the image above, the model is able to attend to the words howand are which are before the masked word, and doing and today which are after the masked word.

Next Sentence Prediction

[CLS] is a special token that is added to the beginning of every input sequence in BERT. There is another special token called [SEP] that is used to separate two sentences. The reason we need these two tokens is that BERT is trained on a second task called Next Sentence Prediction (NSP) in addition to Masked Language Modeling.

In NSP, BERT is given two sentences A and B, and must predict whether sentence B actually follows sentence A in the original text. The input format looks like this:

[CLS] Sentence A [SEP] Sentence B [SEP]

Example:

Input: [CLS] I love playing tennis [SEP] It’s my favorite sport [SEP]
Label: IsNext (these sentences are consecutive)
Input: [CLS] I love playing tennis [SEP] The sky is blue [SEP]
Label: NotNext (these sentences are not related/consecutive)

During training, 50% of the time sentence B is the actual next sentence (labeled “IsNext”), and 50% of the time it’s a random sentence from the corpus (labeled “NotNext”).

The [CLS] token’s embedding, after passing through all the encoder layers, captures the relationship between the two sentences. This embedding is fed into a classification layer that outputs a probability for whether B follows A.

Why is NSP useful?

NSP helps BERT understand sentence-level relationships and coherence, which is valuable for downstream tasks like Question Answering, Natural Language Inference, and Text Summarization.

However, later research (like RoBERTa) found that NSP might not be as beneficial as originally thought, and removing it can sometimes improve performance on certain tasks.

Using BERT in Inference

Imagine we are given the following sentence:

“Howw are … doing today?”

As there is a blank, we need to guess the word that should be filled in. Let’s assume we have 10,000 words in the vocabulary. So, we need to guess the word that should be filled in from the 10,000 words in the vocabulary. The trained BERT from the previous stage, will choose the word that has the highest probability of being the correct word. Here the word with the highest probability is you.

The BERT models chooses th